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AUTOMATIC CONFIGUR/iTTON OF A MICROPROCESSOR 
TECHNICAL FIELD 

The present iavention is in the field of digital computing systems. In particular, it relates to the 
automatic configuration of a microprocessor architecture. 

BACKGROUND ART 

For a new processor Instruction Set Architecture (ISA) to be successful high quality 
development tools and a wide range of application s\q>porting that ISA is required. Compilers 
must be made available that tatget the architecture along with the associated libraries and 
linker. A debugger is required to allow programs to be debu^^ while running on the 
architecture. Modem debu^;qcs need to support symbolic level operation so that code can be 
executed with a view of the original source code. Sofiware eng^eers expect an integ^ted 
development environment diat ties the compiler and debu^^ tools into powerfiil GUI based 
environment. If software engbeers caimot wodc in a familiar software environment then diis 
represents a significant barrier to die adoption of a new architecture. The development of 
such an environment and associated tools represents many man years of development wodc 
even if existing compilers and tools can be retargeted to die new architecture. 

Software devdoped in high level language can be recompiled for execution on a new ISA. 
However, in practice, diis can require significant eflfort Moreover certain types of application 
software such as Operating Sj^tems have strong architectural dependencies \diich make 
porting to a new ISA much more difficult 

There has been a general trend witi:iin the- microprocessor industry to develop new 
generations of faster microprocessors tiiat are backwardly compatible widi existing ISAs. This 
sigaificantly eases die adoption of new product g^erations. However, supporting an existing 
ISA in a new architecture creates significant hardware ovediead especially if die intention is tx> 
extract significant parallelism firom code. This overhead is particularly significant for 
microprocessors used widiin embedded systems where cost is hi^ily significant 
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It is advantageous to be able to support an existing ISA on a new micxoprocessor wifliout 
hardware overhead. Hiis can be achieved using instruction set translation. The ISA of a host 
microprocessor is converted into the ISA of a pardcular target microprocessor. There is a 
sjgoificant body of prior art in the area of instruction set translation. A number of academic 
and commerdal systems have been built that allow binaries written for one architecture to be 
executed on another. One significant challenge is achieving hjgjbi enou^ performance on the 
target architecture. The predse emulation of the idiosyncrasies of an architecture on another 
sigpificantfy degrades performance. 

The simplest method is interpretive emulation. A soft CPU is built on the target architecture 
tiiat is aSle to read and interpretively executes the instmciions fix>m die host architecture. 
Unfortunately tiais method is very slow and inefficient and is largely impractical for use in 
embedded systems. Moreover, this method does not allow the ttanslated code to make 
eflFective use of die particular architectural features of the turget 

The majority of recent research and commercializarion in tiiis field has been in the area of 
dynamic translation techniques. This method allows a very exact emulation of an architecture 
to be achieved while maintaining higji performance levels. As code firom die host architecture 
is encountered it is converted, at run time, into code for the target architecture using a 
dynamic code translator. The translated code can then be stored in a cache. The translated 
code can dien be executed to produce die required results. If the same block of code needs to 
be executed then the translated version firom the cache can be used again without the need to 
translate it again. In some systems an increasing amount of time is devoted to. performing 
optimisations on a particular code sequence in the cache if it is firequentty executed. Thus the 
run time system can target computationally expensive optimisations on firequendy executed 
code. Dynamic translation systems can provide very exact emulations of architectures, even 
for events that are normally very difficult to handle in translation. For instance, self modifying 
code can be handled simply by flushing any affected code Bx>tn die cache. Instructions that 
generate exceptions (such as memory accesses that generate a page miss) can also be handled 
and produce a machine state identical to that of the host architecture. Breakpoints can be 
handled as exceptions so that if a breakpoint is encountered execution can be made to stop at 
a particular host instruction. Single stepping is achieved by producing translated code blocks 
consisting of a single host instmction. An example of a commercially available dynamic 
translation system is that provided by Transmeta Inc. They have designed a soft x86 processor 
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that actually runs on a VLIW architecture, by utilisation of dynamic translation techniques. 
More recendy Transitive Technologies have announced a more general technology that allows 
dynamic translation between a number of difierent embedded processor architectures. 

Dynamic translation is less suitable for embedded computing environments. Firstly, diere is a 
significant memory ovediead created by the translator itself and the size of the cache required 
in order to achieve good performance. Secondly, dynamic translation systems do not provide 
sufficiently deterministic behaviovir. Determinism is especially important for embedded real 
time environments. There is a significant start up delay whS& code firom the application is 
translated into the cache. There may also be significant delays if an important block of code 
becomes evicted firom the cache. 

There is also benefit to the end user being able to extend the ISA of a particular processor. 
This enables fest custom hardware for a particular application domain to be directly accessible 
fi:om software. Some existing configurable RISC processors (such as those supplied by 
Tensilica Inc and Arc Cores) have a facility to extend the instmction set A number of unused 
operation codes are made available and are vised to select an added instmction. The instmction 
execution log^c has to be integrated into the pipeline of the processor in order to receive 
operands and write results back into the reg^ter file. This integ^tion is more automatic in the 
case of the Tensilica solution. Bodi tiie Tensilica and Arc processors have their own 
instmction set and tool chain, The tools can be updated so that the new instmction can be 
accessed throiig^ the compiler and assembler using a user specified nmemonic. 



WO 2004/003730 



PCT/GB2003/002778 



SUMMARY OF INVENTION 

This document discloses a process for automatically configuring a microprocessor using an 
existing ISA These -microprocessors are targeted at embedded sj^stems applications that 
execute repetitive code that contains hig^ degrees of potential parallelism. The 
microprocessors are configured and programmed automatically by die analysis of die 
application sofhoare in die form of an executable image in the ISA of a particular host 
microprocessor. The configured microprocessors have a customised taiget ISA tiiat is 
specifically designed to exploit parallelism in the application software. 

The instmction set translation operates by converting each source machine instruction into a 
sequence of more basic operations to be executed on a target processor. All r^^ters reads 
and writes are formed into separate operations. Thus a 3-address add operation is converted 
into operations to read the left and d^t operand r^^ters, die add operation itself and finally 
an operation to write back the resuk to the regjbter file. Instructions with complicated 
addressing modes or that modify the condition codes result in longer sequences of basic 
operations. 

One disadvantage of prior art static translation systems is tiiat they are unable to support 
debugging using the host instmction set It is obvioxisly imperative to support existing 
debu^prs. The innovations in die translation approach are primarily in die methods to allow 
such debug support By maintaining a correispondence between die host processor state and 
the coprocessor state at specified points in the execution it is possible to support host level 
brealq5oints on the architecture. In other words a breakpoint can be set specified by a host 
instmction address and die architectural state reproduced as dioug^ die code was actually 
running on the host processor. 

Instruction set extension is supported by converting calls to particular software fimctions into 
an invocation of an extension hardware vinit A hardware unit is designed that performs the 
same operation as a particular software fimction. That is, it takes the same parameters and 
produces die same results as the code in its software equivalent The advantage of the 
hardware version is that it will be able to produce the results more quickly. 
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The configured target microprocessors may be vised as coprocessors within a system. Hiey are 
responsible for. executing certain software functions translated ficom an executable image of 
another host microprocessor. This host microprocessor will typicafly also be present in the 
system. Mechanisms are provided to allow the host microprocessor and targpt coprocessors to 
interact and maintain coherence between the memory systems. 
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BRIEF DESCRIPTION OF DRAWINGS 

F^ure 1 provides an ilhisttation of a funcdonal luiit with a number of cycles latency between 
the consumption of inpxit operands and the generation of an output result 

Figure 2 provides some C code \)^ch contains an example of the use of intrinsic functions. 

Figure 3 provides an overview of the hardware used to transform a target ofiEset address into a 
host address. 

Figure 4 provides an example of the mqjping of an indirect function call on the host 
architecture to an indirect function call on the target 

Figure 5 provides an example of a direct funcdon call on the host and how it is translated into 
a call on the target 

Figure 6 provides an overview of the design flow used with die tool 

Figure 7 illustrates how collision pointers are used to handle multiple links that map to the 
same address in the target code area. 

Figure 8 provides an overview of the target data format 

Figure 9 shows the interactions between die host processor, coprocessor and memory in the 
context of die host processor running autonomoiisly wtiSlc die coprocessor is active. 

Rgure 10 provides an overview of the cormectrvity of a host processor and coprocessors in a 
system and how tiiey are coimected to a debiig environment on a remote system. 

Figure 11 shows the interactions between the host processor, coprocessor and memory in the 
context of die host processor being blocked while the coprocessor is active. 

Figure 12 provides an overview of dxe target address format 
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DESCRIPTION OF PRESENTLY PREFERRED EMBODIMENT 



Piooessor Synthesis Row 

The retargeting of existing compilers and debuggers for the preferred embodiment of an 
automatically configured target processor is particularly problematic, as it has no fixed 
instmction set The processor of the preferred embodiment does not siqjport any kind of 
assembly language. Existing compiler and debug tools are designed to be . targeted at a fixed 
architecture and require extensive modification to cope with a variable architecture. A fixed 
intermediate representation is needed to hide the variability of the architecture firom existing 
software tools. 

A particular tool trajectory has been chosen for the preferred embodiment to minimise the 
amount of development effort and to promote adoption of the processor. The trajectory uses 
existing, and therefore familiar, software development tools. The fixed intermediate 
representation is in fact the machine code for an existing processor. Thus all the compiler 
tools and debug tools for that particular architecture can be used. The code generator takes an 
executable for the processor as an input and produces an executable for a customized 
processor as the output 

The translation must be able to take code generated for a host architecture and produce code 
for a particular processor. This code must faithfully reproduce the same results as the original 
code. However, the focus of the preferred embodiment is to provide superior performance on 
particularly key parts of application. This svqjerior performance is obtained throu^ die 
exploitation of higher levels of parallelism. Thus the dock firequency of the system could be 
lowered to achieve the same level of performance and thus reduce power consumption. Even 
thou^ this application code is e3q>ressed in sequential machine code the code generator must 
be able to reorder and schedule the individual operations as required to make e£Eective use of 
the iimovative architectural features supported by the prefixed embodiment Thus indtvidisal 
operations xnay be scheduled in a completely different order to that of the original sequential 
code. 



The prefisrred embodiment consists of a number of individual tools that will be used by 
engineers. Ihe overall design chain is subdivided into these tools to provide greater flexibility 



wo 2004/003730 



PCT/GB2003/002778 



8 

and improve interopetability with systems siq)plied by other Electronic Design Automation 
(EDA) software vendors. 

The oveiall tool chain and relationships between the tools is shown in Figure 6. Hie box 601 
represents die processor generation tool This takes as input executable code for a host 
processor 608 and various configuradon files 609. Alternatively die tool 601 may provide a 
graphical user inter&ce allowing direct control of configuration parameters fiom widiin die 
tool The tool reads and wtites descriptions 602 of candidate processor architectures. In one 
possible flow diroug^ the tool a hardv^^re desccption of a processor may be generated 612. 
This enters a standard hardware syntfiesis and place and route flow 603. In addition to the top 
level harctware description produced by the tool diis may also incorporate library hardware 
elements 604. The output of die flow 603 is hardware 614 tiiat can be used to construct a 
target system 605. 

In another possible flow the tool 601 may generate microcode 613 for a processor diat has 
been previously generated. The architecture of die processor will be stored in an architecture 
descdption'602. 

Altemativdy, die tool may be used to g^erate cyde accurate models of the hardware 615. 
Advantageously^ diese may be generated before hardware generation 612 to allow accurate 
simulation of a processor architecture before commitdng die design to the hardware flow 603. 
The models may be generated as native code diat may be run on a host machine. The code 
may be compiled using the compiler 610 along widi software models 607 of die hardware 
blocks 604. The resulting software may be linked using a linker 611 to produce a simulation 
606. This simulation may also be linked with odier models to provide a complete system level 
model if required. 

Instnicdon Exteodon 

The purpose of die software/hardware interface is to allow engjuieers to specify the bovindary 
between hardware and software in a system. Software languages do not normally have to 
provide any fedlity for specifying that boundary. It is an intrinsic assumption that the 
underiying processor hardware will si^port a set of basic operations (such as addition, 
memory access etc) that makes implementation of the software possible. All software is 
converted into a sequence of such operations by a compiler. 
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The preferred erabodiment also has intrinsic hardware si:5>port for such operations. The 
sv5)port covers the hardware units required for implementation of all the instructions for the 
processor machine code used as input to the tools. However, the hardware units in the 
processor may be extended as required to implement more specialised fiinctions for particular 
operations. Effectively, the processor of the preferred embodiment has a completely 
extensible instruction set. 

Any particular software fimction may be annotated to indicate that it is actually implemented 
in hardware. Software functions are used as an abstraction for an operation actualty 
implemented in hardware. Such functions must not have any side effects, such as the 
alteration of global variables or other areas of memory, since such operations cannot have a 
direct implementation in a hardware unit Each function ^es a nxomber of parameters and 
produces one or more results based direct^ on tiiose input parameters. 

Calls to certain functions can be rqjlaced with uses of iiser specified block of hardware. The 
engpieer adds the fimction to a list of functions that are implemented in hardware. The 
hardware function is given the same name as the equivalent software function. During 
translation, whenever a call to the software function is encountered it is converted into an 
invocation of the hardware unit The parameters that would be passed to the software 
function are passed as the operands to the hardware unit The results that would have been 
returned by the software function are obtained as the results from the hardware unit In this 
way the effective instruction set of the preferred embodiment processor can be extended as 
required. The hardware functions can be accessed directiy fix>m a hig^ level language just by 
calling the appropriate function. Moreover, this can be achieved without having to modify or 
extend the fixed instmction set of the host architecture. 

Figure 2 shows an example usage of the intrinsic mechanism. The example provides a 
hardware implementation of a bit counting fimction 201, This can be performed very 
efiSdentiy in hardware but is much more time consxming in a software implementation. The 
bit count function is called in the code segment 202. If the bit counting fimction is marked as 
being intrinsic then the hardware unit will be used when code 202 is targeted throu^ the tool 
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The example illusttates the power of the meliiodolog7. Within a few lines of code the xiser has 
defined a custom instmction for a processor. There is no need to resort to assembly language 
or any complex definition language. What is more, the program is completely standard 
C/ C++ and is easily readable by any programmer. 

User defined hardware imits do not have to be pxirely computational in nature. For instance, 
fionctions can be written to read and wdte to an array. This corresponds to an additional, 
memory unit in the hardware of the processor. This is espedalty usefiil if extra memory units 
need to be defined to improve overall memory bandwidth for certain applications. 

The software form of a fiinction that is implemented in a hardware unit forms a behavioiaral 
modeL That is, it describes the operation of the execution vmit The behavioural code is 
expected to produce exacdy the same res\jdts diat die real hardware wovild. Such code is 
executed during simulation. The code may access I/O or library fiinctions that would not be 
present on a target system. This aUows the easy capture of trace information firom execution 
units. Particular execution units mig^t represent I/O units in die real sjretem. These tmits can 
g^erate the appropriate stimulus required for the simulation. 

The actual implementation of the hardware is g^erated separately fix>m the behavioural 
implementation. Any development mediodolog7 may be empbyed as long as the behavioural 
model and hardware implementation remain equivalent Normally, an implementation is 
obtaiaed by rewriting the software version into HDL. It can tiien be synthesized to generate 
an actual hardware implementation. Each execution unit only implements a fine grain 
component in the overall system so they are simple to verify. 

Execution Unit Modd 

A single hardware execution xinit may implement one or more individual functions. Each of 
these fimctions is termed a method of the imit This corresponds to the terminology of a class 
encapsulation used in C++. Indeed, if C++ is used as the language to program the 
architecture dien classes may be directiy vised to model a hardware unit, widi the fianction 
members corresponding directiy to these metiiods. 
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FJgtare 1 shows the basic model of an execution unit 103. The underlying model of an 
execution unit is as a synchronous, pipelined unit This fits well with the computational model 
of units widiin a processor. A unit is able to accept a number of operands 102 on a particular 
clock cycle and will produce result(s) 104 a number of dock cycles later. This delay is referred 
to as the latency of the unit and is illustrated as 106. It is expected that the unit is able to 
accept a new operation on every dock cyde. If necessary a blockage can be set for the unit 
diat prevents it accepting anotiier operation for a certain number of dock cydes after die last 
one. Operand data 101 is supplied ficom otiier execution units in tiie architecture and result 
data 105 are supplied to otiier units. The widtiis of operands and results are fu% configurable. 

Code Ttanslarioti 

The translation must be able to take code generated for a particular host architecture and 
produce code for a target processor. This code must feitiifidly reproduce tiie same results as • 
tiiat host machine code. However, tiie* focus of tiie preferred embodiment is to provide 
superior performance on particularly key parts of an application. 

In the translated code certain sequences of operations may be considered to be atomic. That 
is, die execution of the target processor will never stop part way throu^ such a sequence. 
Therefore any intermediate processor state occurring during the execution of such an atomic 
block cannot be visible externally to tiie target processor. Such a sequence is hereafter referred 
to as an atomic block 

Register Represetitation 

A processor of the preferred embodiment has a central register fiOie that is used to hold values 
that are written to registers in tiie host code. There is largdy a one-to-one correspondence 
betv^een tiiese r^^ters and diose present in the host ardiitecture. 

Only those register values tiiat are live at tiie condusion of an atomic block to be stored into 
die corresponding register. A re^ster is live if it mig^t be subsequendy read in tiie program. 
Temporary vises of reg^ters within an atomic block do not have to be rejproduced in the 
register file. Thus tiie amount of reg^ter file traffic can be significantiy reduced in comparison 
to tiie host architecture. 
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The main regsters are direcdy equivalent to diose present in the host architecture. TTie 
m^ority of RISC host architectures have either 16 or 32 r^jtsters of 32 bit width. The same 
nxmiber of r^jisters are present in the target processor reg^ter file. 

Typically a host processor will have a condition code r^^ter. This holds status bits generated 
as a result of certain arithmetic or other operations, such as carry and overflow etc These 
status bits must also be preserved in a central register. A^in, however, they only need to be 
preserved if a renter is live at the end of an atomic block. 

Instruction Ttansladon 

This section describes how individual instmctions in the host architecture are translated into 
sequences of operations for execution on the target architecture. The descriptions are based 
on the mechanisms used for translating a typical RISC instruction set The general techniques 
for translating one instmcdon set to another are well known in the prior art 

Branches 

The branch itself is translated into two separate operations. Firstly there is an immediate load 
that sets the destination address. The actual value is set vAxen the final binary is being written 
and the exact address has been determined. The second operation is the branch itsel£ The 
immediate value is passed to die branch unit This value is then passed to a branch control 
unit 

Hardware Function Calls 

If the host code contains a call to a function marked as being implemented in hardware then 
the call is translated into a use of the hardvrare xinit The software parameters are passed as the 
operands of the hardware operation. There will be a direct correspondence between the 
software fiinction parameters and those that must be passed to the hardware unit 

Hie Application Binary Interfiice (ABI) of tiie host architecture will define how parameters 
must be passed to a function. This information is used dxudng the translation process so that 
the locations of the parameters are known. In gpieral the first few parameters are passed in 
fixed reg^ters and later parameters passed in fixed locations on the stack firame. 
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Code is generated to read each of tbe required parameters {torn the appropriate raster. I^er 
parameters are read from stack frame locations as required. These loaded parameters are then 
passed as operands to the hardware method 

If the software function provides a return result then this must be emvilated from tiie 
hardware cal A function call result is normally returned in a particular fixed r^ter. Code is 
generated to copy the result from die appropriate result port of the hardware unit to the 
register. 

Some parameters may be madced as output parameters corresponding to pointers (or 
reference parameters) to hold results from the function. Code is generated to obtain the 
parameter, representing die destination address, and generate a store of the result port to the 
address. The wrapper code generated aroimd the use of the hardware unit thus allows the 
hardware unit to provide the same behaviour as a software function implementation. 

Software Function Calls 

A software function call is similar to a branch operation except tiiat a link reg^ter is set prior 
to the call The link register holds the return address from the calL In the host instruction the 
link reg^l^r may be implicitly set from the next PC value as part of the instruction operation. 

In the translated version the link register is loaded with an immediate value representing the 
address of the instruction following the call in the original program. This is the return location 
and can be mapped via an address link in the translated image. The immediate value is written 
to the link reg^ter prior to the actual call The call is implemented as a load of the destination 
address, followed by a branch operation. 

Data Processing Instructions 

A particular host architecture will support a number of data processing operations. For a 
•RISC architecture these will typically \ase a 3-address format where a left and right operand is 
specified along with a destination register. Some operations (such as compares and tests) do 
not actually cause a write-back to a register. Addressing modes may be available to allow 
immediate, reg^ter or shifted values to be specified, for instance. The instructions may 
optionally write to the condition code r^Jster. 
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The individiaal instructions are transkted into a niamber of separate operations on the targpt 
architecture. The sequence of operations required is dependent \xpon any addressing mode 
used. Code is first generated to load operands ficom the central register file. This is followed 
by the translated data processing operation itself. The majority of instructions map to a single 
data processing operation. If required then an operation is generated to write tiie result back 
to the destination register. If the instruction updates the condition codes tiien furtiier 
operations are generated to v^date the affected condition code reg^ters. Thus a sequence of 
operations is generated that produce the same efiFect as the original host instruction. 

Thus a single host instruction is translated into a number of individual operations. However, 
in general the later code optimisation phase will be able to eliminate many of tiie register file 
accesses to allow operands ix> be passed directiy between the functional units. 

Any read of the PC register Q£ architecturally visible) is handled specially. Such an operation is 
generally xiscd to calculate tiie address of a data item in a position independent manner. The 
full immediate value after addition is calculated and then a single operation is generated to load 
it via an immediate unit. 

Memoty Access Instructions 

Typically memory access instructions may support a number of addressing modes. The code 
sequence generated is dependent upon die address mode used for die host instruction. This 
allows an address to be automatically incremented or decremented as part of tiie access 
instruction without die requirement for additional address update instmctions. 

These addressiug modes and updates must be subdivided into dieir constituent operations. 
The memory access unit uses the final computed address as its operand. In the case of pre- 
indexing die address is calculated and dien written back to the base register if required. This 
address is then used for tiie access. In tiie case of post-indexing the address is simply formed 
fi:om reading tiie base register. The access is performed and then the full address is calculated 
and written back to the base register. 

Block Memory Instmctions 

The block memory instructions allow multiple words to be loaded or stored to memory widi a. 
single host instruction. The behaviour of such an instruction is xmusual in that it does not 
conform to die general principles of RISC iostruction implementation. It takes a variable 
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nvimber of clock cycles to execute depending upon die number of re^sters diat need to be 
stored or loaded The multiple word access instructions are commonly used in function 
prologues and q>ilogues to save and restore volatile r^^sters on die stack &ame. 

Such block memory instructions are translated into mvdtiple operations in the target 
architecture. The base register is read and dien for each individual access (as determined by 
the re^ster list in the host instruction) a memory operation is generated An individual 
addition to the base address, using an immediate offset, is generated for each access. 
Individual offsets are generated ratiaer than continually incrementing/ decrementing a single 
address value. This improves freedom to allow the memory accesses to be more easily issued 
in parallel with odier operations. 

Translated Code Storage 

The static translation process occurs as a post-link operation. The intention is that this is called 
automatically from the host software development environment If the software IDE does 
not support the calling of a post-link operation tiien a scr^t can be used that incorporates 
both the link and the call of the processor code generation tool 

Since the tool is run after linker it operates on a complete executable. There are no unresolved 
references and the locations for all data sections are determined. No sx4)port is provided for 
any kind of dynamically linking, as such support is less important in embedded development 
environments. 

The executable image provided to the tool should not be stripped of the ftinction symbols for 
the ftmctions that are being translated If necessary all other symbols may be stripped from the 
executable image in order to save space. 

The translation takes die executable imagp and g^erates a new executable imagp that contains 
the translated code. A new section is simpty appended to the executable. From the perspective 
of the host processor this is simply a static data section. It contains all of the translated code 
£ot the target processor. Since exacdy the same format is retained for the executable image, the 
standard tools can be used to download both the host processor and die taigpted processor 
image to the system. Moreover, the imagp can be read as normal by debu^ers in order to 
support symbolic debug. 
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The appended Target Code Area (TCA) is a contiguovis block of memoty that holds tiae code 
for the tai^t processor. It also holds a mapping table that is vised to transform host addresses 
into target code addresses. This moping mechanism is required for making debug of 
generated processors compatible with existing host processors. 

Target Code Area Base Adcbcess 

The TCA has a base address within the virtual address space of the host processor. This may 
be explicitly set as a configuration parameter or, alternatively, an address may be selected that 
follows on firom die end of the existing program section. 

The base address is stored within a data table within iiie executable so that the host processor 
is able to store the base address into a target processor roister named Target Code Base 
(TCB). This allows host to target address mappings to be performed 

Target Code Area Siase 

The size of the TCA depends on the amount of target microcode that needs to be translated 
for the processor. The TCA size is automatically scaled to a suitable size. The size of the TCA 
influences the setting of the Target Code Mask (TCM). The TCM mtast be a mask tiiat causes 
host addresses to be mapped witiiin die TCA Thus the number of set bits within the TCM 
represents a power of 2 size which is the one just smaller than the actual size of the table. The 
reachable size of tiie TCA is ipade as largp as possible to reduce die probability of address 
collisions. 

Most of tiie words within the TCA are used to hold microcode for the processor. These 
words are 32 bits in width even though the actual execution word size of the processor may 
be wider. Individual execution words are siibdivided mto 32 bit words for storage within the 
TCA A type tag stored within each word allows microcode and other data types to be 
interspersed 

Certain words withm the TCA are used to hold address mappings. These are present to 
support the transformation of host addresses into target addresses. Such transformations are 
required in order to allow fiinction returns and indirect function calls using host addresses. 
When used for this purpose the moping is referred to hereafter as an Address link. The 
mapping? are also accessed by the debug unit vrfaen it needs to map a host brealq)oint address 
into an equivalent target code address. Such a mapping is referred to hereafter as a Debug 
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link The mappings must be placed at particular locations in the TCA, since they are part of a 
hash table. Thus otber data types are placed around the mappings, A type tag stored within 
each word allows mappings and othd: data types to be interspersed. 

Mapping Ptooess 

Figure 3 illustrates the mapping process that is used to transform an input host address into an 
address within die Target Code Area. This is used for accessing the Address link and Debug 
Link information from a host address. 

Firsdy, the host address 301 is masked with die Targpt Code Mask (TCM) 302 using die 
hardware 303. This masks off die address so diat it is widiin die size range of die target area. 
The number of least significant bits diat are set m .die TCM will be dependent upon die target 
area size. The lower 2 bits of die TCM are always reset, as all supplied host addresses must be 
word aligned as all host instmctions are word aligned 

The masked vahie is dien added to die Target Code Base (TCB) 304 using die adder 306. This 
is a fixed base value diat gjves die location of die Taiget Code Area in die virtual address 
space of the host processor. It is set via a reg^ter widiin the Bus Interfece Unit After die 
addition the address 305 will be widiin the range of die Taigpt Code Area. 

Address Unking 

The address Hnking mechanism allows host addresses to be used for indirect function calls 
and fiancdon call returns. By using die.host addresses the data stored by die taiget processor is 
compatible witii existing debuggers. 

Funcdon Entry Address Link 

The function entry address link mechanism allows indirect calls to be made usiog the host 
addresses of fiinctions. Indirect fimction calls are expEddy supported in most hi^ level 
languages. 

Figure 4 illustrates how the mechanism works. A tiranslation must be made dynamically 
between the host code address space 401 and the address of microcode within the targpt code 
area 402. The host code performs an kidirect function call 407 using a calculated function 
address. The destination fiinction is shown as 404. For instance, riiis may be as a result of a 
virtiiai fiinction call in C++ where the fimction pointer is obtained firom the virtual function 
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table for the object in question. In general it is not possible to determine what set of functions 
any gjhren indirect call migjit reach The code analysis m\ast assxome that any indirect call can 
reach any function anywhere in the code image. 

If the function has been translated to the tatget processor then it will have an address link 408 
associated with it Ihis allows indirect function calls to be made between functions on the 
target processor. The address link contains the address 405 of the translated form of the 
function 406. Whenever there is an indirect function call in the translated code a special 
address link operation is performed first This performs a mapping 403 from the host 
function address to the target address. An indirect call can then be made to the destination 
pointed to by the link Thus all indirect function calls are made doubly indirect in order to 
reach tKe translated form of the function. If the link mapping does not access a suitable 
address link entry then that indicates that an indirect function call is being made to a function 
that has not been translated 

Return Address Link 

The function call address link mechanism allows a host address, ^^ch would be used in the 
original vintranslated program, to be used in the translated version. The return address is 
loaded into the link register by a call instmction in die host code image and this value is 
architecturally visible. The link raster is preserved on the stack fi:ame if the callee function 
makes any further calls. The debugger reads these preserved link values in order to g^erate a 
stack trace back and show the location the calling points represented on the stack. Thus to 
iTioinf-am compatibility with debia^prs the host link address must be used. 

The return address link mechanism is illustrated in Figure 5. In the host code address space 
501 a call 504 is made. This call will load a return address for the instmction following the call 
505. That is the address to which execution returns after the call In the translated code image 
that return host address 505 has an address link entry 510 associated with it The address link 
points to the translated form of the instructions following the odg^al call 509. The translated 
version of the call 508 explicitly bads the link reg^ter with the address of the following host 
instruction 505, in the same maimer as the otiginal code. In the callee function (not shown), 
the return instruction (which is essential^ an indirect branch to die link register) is converted 
into an address HnW operation followed by an indirect branch. The map address link obtains 
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the address via the mapping 506 to obtain the content of the address link. The following 
indirect branch then initiates execution at 509 after the translated call site. 

Hiis mechanism allows the host return addresses to.be used and thiis full compatibility 
maintained vAth debu^ers for the host architecture. The onty cost is the requirement to 
expliddy load the link register with an immediate address before a call and an extca m^ link at 
the point of a function return. 

Debug linking 

Debug Links are placed into the Target Code Area in order to 5iq>port the debug of translated 
code. There is at least one debug link for each atomic block in the target code. Thus the 
number of debug links will gpneraUy be much greater than the number of address links in the 
Target Code Area. They .provide a mapping from a host address to a particular execution 
word. That execution word represents the start address of an atomic block. 

By providing debug links at atx>mic block g^uktity it is possible to provide breal^oints that 
are only activated if a particular path through die code is taken. Each atomic block represents 
a particular sequence of conditionally executed code. Only one debug link needs to be 
provided for each atomic block since the brea]q)oint can occur at the start of the atomic block 
and then code can be executed on the host processor to advance the execution point to the 
exact breakpoint. This significantiy reduces the number of links that are required in the Target 
Code Area. 

link Collisions 

Address and Debug Links are placed at locations in the Targ^ Code Area that are determined 
by tiie least significant bits of the host address. This is a simple hash table representation that 
simply uses these bits as the hashing function. Given this address scheme it is possible that 
multiple Address or Debug Links may map to the same location in the Target Code Area. 
Thus a mechanism is required to handle such collisions. The Target Code Area is made as 
largp as possible to reduce the number of collisions. 

A link collision example is shown in Figure 7. The host code address space is shown 701 with 
the requirement for two address links associated with the instructions 703. Bodi of these 
instruction addresses map 704 to the same address link 710. These addresses map to the same 
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location in the Target Code Area because die host adckesses share all the same least si gnifican t 
bit values that are not masked by the TCM. 

The collision is detected and a Collision Pointer 710 is placed in the Target Code Area 70Z 
The purpose of the collision pointer is to point to another area of memory in the Target Code 
Area that holds all the Address or Debug links that mapped to the same initial location, Hie 
upper bits of the Collision Pointer 709 hold a count of die total number of entries in the 
indirect collision sequence. The Collision Pointer as an ofeet address 708 to the indirect 
sequence of links 705 via the address 707. The indirect sequence itself consists of a number of 
Address or Debug links. They are marked as Address or Debug links via their tags 706. 
These have a special flag bit indicating diat they are obtained indirectly via the Collision 
Pointer. These avoids them being incorrectiy used as Address or Debug links for the 
locations to which they are allocated. All of the indirect Address or Debug links are 
considered to be associated with the host address of the Collision Pointer. Note that the 
indirect sequence of links may be interspersed with direct links. The two can be differentiated 
by use of the flag bit 

Taiget Data Format (TDF) 

This section describes the different types of data that can be represented in the Target Code 
Area. This is illustrated in Figure 8 showing the possible TDF types. Each of the data types is 
32 bits in size and is distinguished using a 2 bit type tag 814. 

Type 801 represents a word of microcode stored in 805. Type 802 represents an address link. 
The bits 808 provide an offset in the target code area. The number of bits allocated to 808 
depends on tiie me of the target code area. The bits 807 provide a tag comparison against a 
number of die bits not used to index the location in the target code area. The bits 806 provide 
various control attributes of die destination code. Type 803 represents a debug link It has a 
very sinular format to an address link. Bits 811 provide die offset, bits 810 are for comparison 
and bits 809 provide control attributes. 

Target Address Format (EAF) 

The TAF is used as a common format for transferring destination addresses. The 
representation allows both host and target addresses to be specified in a single format This is 
a requirement to allow host addresses to be specified when calls or branches are made to code 
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that has not been translated. Moreover, if a host to tar^t address translation feilg then this 
format allows die host address to be retained. Thus an appropriate host continiaation address 
can be generated if such a branch is taken. 

The format of the TAF is shown in Figure 12. It is designed to be a dose subset of the TDF 
to allow simple transformation of address links obtained in TDF, 

Type 1201 rq>resents a host instmction address stored in bits 1202. The lower two bits 
contain the tag of 00. Thus instruction addresses must be word aligned. This is a property that 
is g^erally true for 32 bit RISC architectures. Type 1203 represents a target address. The bits 
1205 gh^en the actual address of taiget code to execute and bits 1204 gives the control 
attributes. 

DdbugEnrixomnent 

Before an application is ever run on real hardware it wiU have been tested in a simulation 
environment This allows fiiU cyde and bit accurate testing. Stimulus and behavioural 
modelling code will be produced to emulate the physical environment diat the application will 
be executed within. This process wiU allow the discovery of most major bugs in the 
application. Since the simulation runs natively using a C-f-+ environment, the engineer is able 
to use his or her favourite debu^jer and integrated development environment 

Of course, there are alwaj^ likely to be application level bugs that only manifest themselves in 
the real hardware environment To allow easy analysis of these, the preferred embodiment 
supports a powerful debug environment 

The overall debug architecture is illustrated in Figure 10. A remote system 1006 communicates 
with die target system via a serial or parallel link 1010. A serial link may be used since high 
data speeds are not requited and there is a need to minimise the area that the debug hardware 
occupies. A remote debuggng protocol is run over the link. The remote debiagger can' send 
commands to the systom to set breakpoints, read/write memory and read/write reg^ters etc 
The remote debugg^ will be compatible with the instmction set of the host processor in the 
system. The physical inter&ce 1005 links to the blocks within the system. Typically the 
phj^cal inter&ce will be compatible with JTAG. 
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The host processor in the system 1001 will contain a debvig control vmit 1003 connected to 
the debug channel 1007, Typically the debxag control unit will contain status and brealqpoint 
registers. Breal^oint regfeters allow execution to be halted at a particular instruction address. 
The host processor will connect to a number of coprocessors 1002 via a system bus or 
coprocessor interface 1009. The coprocessors are ruxining code that has been translated firom 
die same executable being run by the host processor. Each coprocessor will contain a debug 
control unit 1004. These may snoop data 1008 fix>m the same debug channel as the host 
processor. 

Breakpoint settings intended for the host processor can be detected by the debug control 
xxmts 1004. These breakpoints will initially be specified as code addresses relating to die 
location of functions on the host processor 1001. The debug control units will use die address 
linking and debug linking mechanisms to translate those Into an address in the translated code 
of a coprocessor. If the function is not mapped to the coprocessor then no mapping will be 
located and thus no brealqpoint will be set ' 

The coprocessor contains a nxamber of breakpoint reg^ters in the debug unit 1004. These are 
set with the result of the address finking process. These cause the processor to halt if the target 
code position of the breakpoint is reached Execution is halted if a particular atomic block is 
reached. This allows breakpoints that halt the machine on die equivalent of a particular host 
instmction in the code. 

If the program execution were to be stopped on a brealq)oint on the boundary between 
atomic blocks then all the important regjister and memory state woidd be the same that 
observed on die host architecture. Of course, brea^>oints can be set on any host instruction. 
Reducing the sLze of an atomic block to a sin^e host instruction wovdd dramatically reduce 
optimisation opportunities and thus die performance of the processor. 

Brea^>ointing on an individual host instruction is achieved as follows. A brealqsoint is set by 
specifying a host instmction on i^ch to halt This is converted, using the previovisly 
described debug linking means, into the address of a particular atomic block in the translated 
code. A breakpoint is set on the coprocessor at the start of that particular atomic block. This 
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atomic block will be the one immediatety preceding the translation of the required host 
instruction. 

When the breakpoint is detected the execution is continued back onto the host processor. The 
raster and any modified memory state held on the coprocessor is sent back to the host 
processor. The host processor will have had exactly the same breakpoint set Execution on the 
host processor will continue firom the first instruction associated with the breakpointed atomic 
block. Execution then continues instmction^by-instruction until the precise breakpointed 
instraction is reached. In this mani:ier the instmction level state at the breakpoint can be 
reproduced firom a combination of state generated by the coprocessor and the host processor 
itself. 

To allow higji levels of parallelism in the architecture, code can be scheduled out-of-order 
with respect to the original sequential code. Results may be generated in a completely different 
order to the way they are expressed in the original sequential code. The user should not need 
to be aware of this. When they are debugging the code and sin^e stepping through it they 
expect expressions to be evaluated and results produced in the sequential order expressed in 
the od^nal sequential code. 

Executable Update 

In the preferred embodiment a specialised coprocessors may be generated automatically. 
These shouki interact with the host processor in the system in as seamless a manner as 
possible. The software application should be able to run actoss the combination of both die 
host processor and the coprocessors in the system. Certain software fiinctions are marked for 
execution on a coprocessor. Whenever the fimction is called on the host processor the 
execution flow should be automatically directed to the coprocessor. 

To this end the orig^ host code executable is modified automatically in the preferred 
embodiment The initial instmctions in the host code for Sanctions that are being mapped to a 
coprocessor are modified to load the address of the fiinction on the coprocessor and branch 
to a common handling fimction. This handling function is responsible for communicating 
with the coprocessor. Certain aspects of the host processor state (such as the registers) may be 
transferred across to the coprocessor. Hie coprocessor execution is then initiated firom the 
required address to execute the translated fimction. When the function execution is completed 
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the state is transferred back to the host processor. Execution may then continue on the host 
processor. Advantageously, this provides the eflfect of a transparent oflQoad of the function 
onto a coprocessor- 
System Ajxdiitectiue 

This section describes the options for the system architecture of the preferred embodiment It 
is desirable to provide a shared memory environment where the coprocessors can access the 
same address space as the host processor. This allows pointers to be fteely passed between the 
two environments and allows complex data structures to be shared. 

Providing a shared memory environment adds hardware complexity, as caches are reqdred 
within the coprocessor that must remain coherent with contents of other caches in the system. 

There are two possible interaction models for the host processors and the coprocessors as 
detailed below: 

Blocking Model 

In the simplest configuration the host processor is blocked \diile the coprocessor is executing 
functions. An illustration of this architecture is given in Figure 11, A host processor 1101 
contains a cache 1108 and also an interfiice to the system bus 1107. The main memory 1103 
will be coimected to the processor using the system bus 1110. Optionally, the host processor 
may have a specialised coprocessor interfece 1109. A coprocessor 1102 may be connected to 
the host processor either via the system bus and a bus interface unit 1105 or via an optional 
inter&ce to a coprocessor port 1106. 

It is e3q>ected that the host processor contains a cache 1104. For good performance such 
caches normally include use a write-back rather than a wnte-throug^ caching mechanism. 
Thus data that has been updated is only written back to main memory when the cache line 
needs to be evicted. 

The coprocessor is implemented as a slave to the host processor. Each coprocessor is 
allocated a block of r^^ters in the address map of the bus. These raters can be accessed by 
software ruiming on the processor. Transmission of data fi:om the coprocessor to the host 
processor is performed via die host processor reading registers stored witiain the interface. 



wo 2004/003730 



PCT/GB2003/002778 



25 

The coprocessors roay also have the capability to generate an interrupt to the host processor 
in order to handle a critical event or something Oittside of the normal communication 
protocol 

In this model all memory accesses ace directed via the host processor. This allows all addresses 
handled by the coprocessor to be virtual Thxis the cache 1104 is indeed using virtual 
addresses. Memory addresses supplied to the host processor are automatically translated into 
physical addresses using the address translation mechanisms already implemented by the host 
processor. 

When the host processor timeline 1112 encounters a function diat is being executed by die 
coprocessor die reg^ter state is passed to the coprocessor 1114. The coprocessor 1113 sho^ 
die execution of die functions. As soon as the initiation is received from die host processor 
the coprocessor leaves its sleep state 1119. While die coprocessor is running die host 
processor is blocked 1116 waiting for requests from die coprocessor. During its execution die 
coprocessor will initiate requests 1117 if diere are cache misses. These requests will be handled 
by the host processor and diose which cannot be satisfied from data in the coprocessor will 
result in transactions to the main memory 1118. The main memory timeline 1111 shows die 
memory being idle 1 120 unless it receives a transaction request 

When the end of die function executed is reached any dirty data in die cadxe 1104 is written 
back 1115. The coprocessor can then re-enter its sleeping state 1119. 

Non-Blockitig Itiq>lementation 

The non-bloddng model provides a more complex interaction between the host processor 
and die coprocessor. In diis model die host processor may continue and perform odier tasks 
while die coprocessor is operational This relies on die coprocessor being able to become a 
bus master and initiate memory accesses direcdy. Since the coprocessor must be able to 
initiate memory accesses iising physical addresses it needs to be able to perform a virtual to 
physical address translation. 

The model relies on die use of threads in die q>plication program running on die host 
processor. When a particular diread encounters a function that has been mapped onto a 
particular coprocessor die diread eflSectively transfers onto die coprocessor. The host 
processor is dierefore freed to continue running other dureads. 



wo 2004/003730 



PCT/GB2003/002778 



26 

An example configuration is shown in Figure 9. A host processor 901 contains a bus interfece 
unit 904 that interfeces to the sj^tem bus 908. The main memory 903 is connected to the 
system bus. The host processor also contains a cache 905 and a Translation Lookaside BuflFer 
(TLB) 919. Ihis contains a cache of translations between virtual addresses and physical 
addresses in the memory system. A coprocessor 902 is connected to the host processor via a 
bus interface unit 906. The coprocessor also contains a write back cache 904. 

The memory address translations must be coherent 920 between the host processor and a 
TLB held by the coprocessor in the bus interface unit 906. This TLB is used for mapping 
virtual to physical addresses before initiating a memory transaction widi die main memory 
903. 

A shared virtual memory system also requires management of die entries within the TLB, In 
this configuration it is assumed that the host processor is running an operating sjrstem that 
determines when to page in and page out particular blocks of virtual memory in the physical 
address space. Whenever there is a miss in the TLB of a coprocessor an interrupt to die host 
processor may be generated This caxises the required virtual page to be looked up in the page 
tables (bringing the data in firom secondary storage if required) and the physical page address 
transmitted to the coprocessor where it is stored for future usage in die TLB. Moreover, if a 
physical page is ever reclaimed by the operating system for use by anodier virtual pagp then 
the corresponding entries in all coprocessor TLBs must be evicted. This is done xasing a 
broadcast message sent from the host processor. Thus this mechanism requires changes to be 
made to tiie memory management handling routines widiin die kernel of the operating 
Sj^stem. 

The host processor timeline 910 is shown executing a first thread 914. If this thread 
encounters a function diat should be executed on the coprocessor then any dirty data in the 
host processor cache is first written back to main memory 913. The coprocessor is then 
initiated by transferring tegstet state across to it 917. The coprocessor timeline 911 is diverted 
at that point to start execution 912 of the functions. The host processor initiates another 
thread 915 that may execute ^rfiile the first thread is being executed on die coprocessor. Cache 
misses 918 in the coprocessor initiate direct transactions widi the main memory 909. Issues of 
coherence between the host processor and coprocessor are dealt with by the standard thread 
synchronisation requirements for shared memory. When the execution of the coprocessor 
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fbnctions is complete any dirty data can be written back to main memory 916 and the host 
processor is able to proceed with the original thread 914. 

It is understood that there are many possible alternative etnbodiments of the invention. It is 
recognized that the description contained herein is only one possible embodiment Hiis 
should not be taken as a limitation of tiie scope of the invention. The scope should be defined 
by die claims and we therefore assert as our invention all diat comes within the scope and 
spirit of those claims. 



